{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Lab 8 - Computing probabilities\n", "\n", "This lab will use the 311 service request dataset from NYC Open Data, which contains data about complaints and service requests (ex. schedule electronic waste pickup) made by calling 311 from 2010 to the present. Each row corresponds to one complaint or request. \n", "\n", "In this lab, you will learn how to estimate probabilities from data.\n", "\n", "To download and filter the data:\n", "\n", "1. Go to [https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) and click \"View Data\".\n", "2. We will filter the data to only contain complaints made on Feb. 19, 2019:\n", " \n", " a. If necessary, click on \"Filter\", and then click on \"Add new filter condition\".\n", " \n", " b. Select the column \"Created Date\" and change \"is\" to \"is between\".\n", " \n", " c. For the first date, select 02/19/2019 12:00:00 AM\n", " \n", " d. For the second date, select 02/20/2019 12:00:00 AM\n", " \n", " e. Check the box to left of the first date. The data should change, so that only those complaints created on Feb. 19, 2019 show.\n", " \n", " f. To download the filtered data, click Export, then CSV.\n", " \n", " g. Upload the download data file to Jupyter Hub. \n", " \n", "As usual, we will import the matplotlib and pandas packages, and set plots to appear in the Jupyter notebook. The final line shows all columns in the dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline\n", "\n", "# show all columns when displaying the dataframe\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, write code to load your 311 data into a dataframe called `calls`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the dataframe was created properly by displaying it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More bar charts: only plotting the top categories\n", "\n", "We will focus on the different types of complaints. First, let's explore this data by creating a bar chart of the number of each type of complaint. Write your code to do this below. If you need a reminder, we also made bar charts in Labs 3 and 7." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", " counts_variable = dataframe_name[\"column_name\"].value_counts()\n", "counts_variable.plot(kind = \"bar\")\n", "
\n", "\n", "What do you notice about your bar chart? How useful is it?\n", "\n", "When there are a lot of categories, we can only plot the top 10 (or top 5, or top 20, etc.) using the `head()` function. Run the following code, changing the name of the variable holding the complaint counts (`complaint_counts` here) to match your code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "complaint_counts.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What did it do? What happens if you change the parameter 10 to 5? To 15?\n", "\n", "We can either save the top 10 complaints in a varaible and plot them as a bar chart:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "top10_complaints = complaint_counts.head(10)\n", "top10_complaints.plot(kind = \"bar\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can string the two functions together:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "complaint_counts.head(10).plot(kind = \"bar\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use either methods. Can you create a bar chart of the top 20 complaints?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the most common complaint type? \n", "\n", "### Computing probabilities\n", "\n", "Next we will compute the probability that a complaint or request is about illegal parking. What's the formula for computing this probability?\n", "\n", "$$\\text{Probability that a complaint is about illegal parking} = \\frac{\\text{# of complaints about illegal parking}}{\\text{total # of complaints}}$$\n", "\n", "First we will count the total number of complaints, which is just the number of rows in the dataframe. There are two ways to do this: `len(calls)` or `calls.shape[0]` Try them both below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason there are square brackets after `calls.shape` instead of parentheses is because `shape` not a function but a property of the dataframe. \n", "\n", "Try typing `calls.shape` below and running it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think 41 refers to? We can get the number of columns with `calls.shape[1]`. Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the total number of calls (rows) again, and this time store it in the variable `num_calls` so we can use it later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we have to count the number of calls about illegal parking. First we use a filter to identify these rows:\n", "\n", "`parking_filter = calls[\"Complaint Type\"] == \"Illegal Parking\"`\n", "\n", "This code looks for the rows in the `Complaint Type` column that read `Illegal Parking`, and stores `True` in the variable `parking_filter` for those rows and `False` otherwise. Type and run this code below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display the contents of the `parking_filter` variable below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " parking_filter\n", "
\n", "\n", "In `parking_filter`, `True` values are actually stored as 1's and `False` values are actually stored as 0's to save space. Therefore, to count the number of `True` values, we can add them up with the `sum()` function. Type the code `parking_filter.sum()` below and run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's save this count as the variable `num_illegal_parking`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " num_illegal_parking = parking.sum()\n", "
\n", "\n", "Finally, to compute the probability that a 311 complaint is about illegal parking, we divide the number of illegal parking complaints (stored in `num_illegal_parking`) by the total number of calls (stored in `num_calls`). Recall we learned how to do math in Python in Lab 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " num_illegal_parking/num_calls\n", "
\n", "\n", "What percentage of 311 complaints are about illegal parking? Did this number surprise you?\n", "\n", "Here's another example. Let's compute the probability that an illegal parking complaint is about a blocked hydrant. What's the formula?\n", "\n", "$$\\text{Probability an illegal parking complaint is about a blocked hydrant} = \\frac{\\text{# of illegal parking complaints about blocked hydrants}}{\\text{total # of illegal parking complaints}}$$\n", "\n", "We computed the number of illegal parking complaints above, so we just need to compute the number of blocked hydrant complaints. A blocked hydrant complaint has `Blocked Hydrant` in the `Descriptor` column. \n", "\n", "Can you figure out how to make a filter for blocked hydrant calls? \n", "\n", "Hint: Take the filter we made above (`parking_filter = calls[\"Complaint Type\"] == \"Illegal Parking\"`) and change some of the parts of it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " hydrant = calls[\"Descriptor\"] == \"Blocked Hydrant\"\n", "
\n", "\n", "Next, write code below to count the number of `True` values in your hydrant filter.\n", "\n", "Hint: Look back above to see how we did this with the `parking_filter` filter." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " num_hydrant = hydrant.sum()\n", "num_hydrant\n", "
\n", "\n", "Finally, do the division to compute the probability that an illegal parking complaint is about a blocked hydrant. You may need to save some of your previous calculations as variables if you didn't already." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What percentage of complaints about illegal parking are about blocked hydrants? Is this what you expected?\n", "\n", "#### Challenges:\n", "- What is the probabiliy that a 311 complaint is about no heat or hot water? A complaint about no heat or hot water is listed as `HEAT/HOT WATER` in the `Complaint Type` column.\n", "- What is the probability that a 311 complaint is about rodents (listed as `Rodent` in the `Complaint Type` column)?\n", "- What is the probability that a 311 complaint about rodents is about mice? (listed as `Mouse Sighting` in the `Descriptor` column)?\n", "- Bonus: Can you figure out how to make a bar chart of the different kinds of illegal parking complaints? Hint: Make a new dataframe containing only the illegal parking complaints. There is code for making a new dataframe (`solo_artist`) from a filter in Lab 7. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }